The datasets were created, using red wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model these datasets under a regression approach. The
# Getting the dimension of the dataset
dim(ds)
## [1] 1599 13
We have 13 variables and 1599 entries.
# Name of the rows
names(ds)
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
We are unsure of the variable “X”. Let’s see what kind of values it contains.
ds[1:10,"X"]
## [1] 1 2 3 4 5 6 7 8 9 10
This column seems more like an index than anything else.
# Removing X from the dataframe
ds <- subset(ds, select = -X )
# Let's check for type of the variables
str(ds)
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
We see except quality all other variables are numerical (continuous), whereas quality is integer (discrete).
Here’s a description of the data.
Acids are major wine constituents and contribute greatly to its taste. In fact, acids impart the sourness or tartness that is a fundamental feature in wine taste. Wines lacking in acid are “flat.” Traditionally total acidity is divided into two groups, namely the volatile acids (see separate description) and the nonvolatile or fixed acids. Wines that are high in acidity tastes sour.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
The graph is right skewed. The median is 7.9 and values larger than 12.35 are outliers. We can also say that the wines mostly tend to have a medium mix of non-volatile acids.
Volatile acidity refers to the steam distillable acids present in wine, primarily acetic acid. While acetic acid is generally considered a spoilage product (vinegar), some winemakers seek a low or barely detectable level of acetic acid to add to the perceived complexity of a wine. In addition, the production of acetic acid will result in the concomitant formation of other, sometimes unpleasant, aroma compounds.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
The graph seems right skewed, making it clear that the most of the wines tested seems to contain low quantity of volatile acids, while some tend to have high quantities. There are a few outliers undoubtedly.
These inexpensive supplements can be used by winemakers in solidification to boost the wine’s total acidity. It is used less frequently than tartaric and malic due to the aggressive citric flavors it can add to the wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
The graph seems to follow no particular pattern. But we can say that it follows a uniform distribution till 0.5 after which it’s presence falls.
Residual Sugar, or RS for short, refers to any natural grape sugars that are leftover after fermentation ceases (whether on purpose or not). The juice of wine grapes starts out intensely sweet, and fermentation uses up that sugar as the yeasts feast upon it. So if the wine has sugar you will probably want strong acidity.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
The distribution is right skewed with a large number of outliers. It seems that the residual sugar is mainly concentrated around a low value.
The amount of salt in wine is increased in wines coming from vineyards which are near the sea coast, which have brackish sub—soil or which have arid ground irrigated with salt water and the molar ratio cf Cl/Na+ therefore varies significantly and can even have a value close to one which could imply the addition of salt (NaCl) to the wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
The graph seems right skewed suffering from heavy outliers. Also the content of salt seems pretty low.
In the wine industry, sulfur dioxide (SO2) is frequently added to must and juice as a preservative to prevent bacterial growth and slow down the process of oxidation by inhibiting oxidation enzymes. SO2 also improves the taste and retains the wine’s fruity flavors and freshness of aroma
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The graph seems right skewed.
amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The graph seems right skewed with median as 38 (less than 50) and the mode is 20.
The density based on the percent alcohol and sugar content.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The graph seems normally distributed.
describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
The graph seems normally distributed.
A wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
The graph seems right skewed.
The percent alcohol content of the wine
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The graph does not follow a pattern.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
The graph seems normally distributed with mode as 5.
What is the structure of your dataset?
There are 1599 entries with 13 features (12 + 1 added as rating).
What is/are the main feature(s) of interest in your dataset?
There are quite a few like the balance of acidity, residual sugar and chlorides that engineers the taste of wine. Again how other factors like density and pH varies.
What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
There can be the various forms of acids like volatile, non-volatile and citric that contribute to the features of interest. Again alcohols can also be responsible for pH.
Did you create any new variables from existing variables in the dataset?
Yes. I created a rating variable based on the existing feature called quality. 3-4: Bad, 5-6: OK, 7-8: Good.
Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
There may not be any unusual distributions, however some graphs do not follow any regular pattern. This may be due to some larking variables. Since the data provided is promised to be clean, I did not perform any wrangling.
Before we dig in bivariate plot analysis, it is a good thing if we look for at the co-relation matrix, which will help us to identify the potential related variables. Analyzing them may be interesting then.
Some potential positively related variable pairs are:
Some potential negatively related variable pairs are:
Let us now visualize how the above pairs will look when plotted.
## [1] "The co-relation coefficient is 0.671703434764106"
We can say that this is a must trend as citric acid is mixed to increase total acidity. So in cases where citric acid is high, the non volatile acid should also be high.
## [1] "The co-relation coefficient is 0.668047292118974"
This is an interesting correlation. We know wines with high residual sugar also has high acid content making it more appealing to the taste buds. Also, density and residual sugar is positively correlated making the above plot possible.
## [1] "The co-relation coefficient is 0.364947175211251"
This is a weak correlation, making it possible that high density wines may have actual acid content high (not artificially induced citric acid) based on the previous plot.
## [1] "The co-relation coefficient is 0.355283370983376"
This is what we have talked about before, one of the factors of high density is residual sugar. This graph is a good example to prove it, the correlation efficient though being slightly low.
## [1] "The co-relation coefficient is 0.476166324001136"
This is invariably true that with high percentage of alcohol, the quality of wine will be better.
## [1] "The co-relation coefficient is -0.341699334785031"
This is again true. With highest fixed acidity, we get lowest pH value and vice-verse.
## [1] "The co-relation coefficient is -0.55249568455958"
To boost the wine’s total acidity, either citric acid or volatile acid is added (but not at the same time but as a trade-off). So, here’s a clear trend. Also, the basic use of the volatile acid remains within 0.6.
## [1] "The co-relation coefficient is -0.54190414473951"
This is again true. With highest citric acidity, we get lowest pH value and vice-verse.
## [1] "The co-relation coefficient is -0.496179770241702"
Alcohol is lighter than water, that is density is less than 1. Thus the graph is true.
## [1] "The co-relation coefficient is -0.390557780264007"
Volatile acidity is undesirable as it induces a bad taste. So, less volatile acid means good quality wine and vice-verse.
Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
Density and acid/residual sugar shows a positive trend which is really as it should be as these chemicals adds in to the density. Also greater quantity of alcohol results in good quality of wine.
Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
Features such as inversion of pH with acidity, quality and volatile acids (high volatile acids imparts a bad taste).
What was the strongest relationship you found?
Fixed acidity vs citric acidity is the strongest relation that I found having a Pearson co-efficient of .67.
Let us see how quality is dispersed among citric acid and fixed acidity.
We can see from this diagram that the darker shades are in the bottom section of the graph making it clear that bad quality of wine do have lower to medium contents of both the acids. That also suggests good winemakers do indulge in adding citric acids to right proportions to pull up the acidic content.
We can see that the good quality wines have low density but high acidic contents.
Again we see that good wines or lighter blues have low density but high residual sugar content.
Woow! Nicest trend. The bad quality wines have less alcohol content and are denser, while the opposite (high alcohol and less dense) is well reserved for the good wines and they tend to have a balanced pH (preferably low).
Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
What we were guessing from the very first about presence of acidity and residual sugar in good quality wines is proved. Also good quality wines have greater alcohol and less dense with comparatively low pH.
Were there any interesting or surprising interactions between features?
Winemakers tend to use citric acid to increase the overall acidity than relying on the inherent non-volatile acidity.
OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.
No.
The most promising plot supporting the argument that we have started from the very beginning - “Good wines have high alcohol content and lesser density with medium pH.”
We can see that the clustering of good wines is near a place where the density is low but the acidity is high. This supports our previous argument that good wines have high acidic content.
The argument that winemakers uses citric acid to pull up the acidic content of the wine in cases where volatile acidity is low. So, if the natural acidity is in good proportions, artificial citric ones are not included.
This dataset has details of 1599 wines varied by twelve features from around 2009. I first did single variable analysis on the dataset, thus understanding the basic features of the wines and building the first step towards a skeptical data exploration journey. Next I visited the interesting variables in pairs and started to note down the possible trends trying to identify he reasons which determines the quality of wines.
Finally I ended up with the argument that is, Good wines have a high alcohol content, lesser density, medium pH and low density.
The hurdles I faced while developing the project was I was getting a single color whose tone varied for the different qualities of wines. Now, this was a bit challenging to interpret as the similar colors are hard to distinguish. So, I ended up factoring the discrete variable quality for the plotting purpose. The next problem was the six colors are not again totally visually appealing. So, I created a new variable - rating and based on the quality, rated the wines as Good, OK or Bad. This made things easy to interpret and look cool.
This exploration can even be boosted by using functions such as SelectKBest to know which features contribute most to predicting the quality of wine. Then we can build one or two classifiers and make a good wine predicting model for the future.